local policy
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.73)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
FOVA: Offline Federated Reinforcement Learning with Mixed-Quality Data
Qiao, Nan, Yue, Sheng, Ren, Ju, Zhang, Yaoxue
Offline Federated Reinforcement Learning (FRL), a marriage of federated learning and offline reinforcement learning, has attracted increasing interest recently. Albeit with some advancement, we find that the performance of most existing offline FRL methods drops dramatically when provided with mixed-quality data, that is, the logging behaviors (offline data) are collected by policies with varying qualities across clients. To overcome this limitation, this paper introduces a new vote-based offline FRL framework, named FOVA. It exploits a \emph{vote mechanism} to identify high-return actions during local policy evaluation, alleviating the negative effect of low-quality behaviors from diverse local learning policies. Besides, building on advantage-weighted regression (AWR), we construct consistent local and global training objectives, significantly enhancing the efficiency and stability of FOVA. Further, we conduct an extensive theoretical analysis and rigorously show that the policy learned by FOVA enjoys strict policy improvement over the behavioral policy. Extensive experiments corroborate the significant performance gains of our proposed algorithm over existing baselines on widely used benchmarks.
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Asia > China > Beijing > Beijing (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- Information Technology > Security & Privacy (0.67)
- Education (0.66)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Asia > Middle East > Jordan (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
Hierarchical Decentralized Stochastic Control for Cyber-Physical Systems
Kaza, Kesav, Anantharaman, Ramachandran, Meshram, Rahul
This paper introduces a two-timescale hierarchical decentralized control architecture for Cyber-Physical Systems (CPS). The system consists of a global controller (GC), and N local controllers (LCs). The GC operates at a slower timescale, imposing budget constraints on the actions of LCs, which function at a faster timescale. Applications can be found in energy grid planning, wildfire management, and other decentralized resource allocation problems. We propose and analyze two optimization frameworks for this setting: COpt and FOpt. In COpt, both GC and LCs together optimize infinite-horizon discounted rewards, while in FOpt the LCs optimize finite-horizon episodic rewards, and the GC optimizes infinite-horizon rewards. Although both frameworks share identical reward functions, their differing horizons can lead to different optimal policies. In particular, FOpt grants greater autonomy to LCs by allowing their policies to be determined only by local objectives, unlike COpt. To our knowledge, these frameworks have not been studied in the literature. We establish the formulations, prove the existence of optimal policies, and prove the convergence of their value iteration algorithms. We further show that COpt always achieves a higher value function than FOpt and derive explicit bounds on their difference. Finally, we establish a set of sufficient structural conditions under which the two frameworks become equivalent.
- North America > United States > Massachusetts (0.04)
- North America > Canada > Ontario > National Capital Region > Ottawa (0.04)
- Europe > Netherlands > North Brabant > Eindhoven (0.04)
- Asia > India > Tamil Nadu > Chennai (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.69)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.47)
Local Policies Enable Zero-shot Long-horizon Manipulation
Dalal, Murtaza, Liu, Min, Talbott, Walter, Chen, Chen, Pathak, Deepak, Zhang, Jian, Salakhutdinov, Ruslan
Sim2real for robotic manipulation is difficult due to the challenges of simulating complex contacts and generating realistic task distributions. To tackle the latter problem, we introduce ManipGen, which leverages a new class of policies for sim2real transfer: local policies. Locality enables a variety of appealing properties including invariances to absolute robot and object pose, skill ordering, and global scene configuration. We combine these policies with foundation models for vision, language and motion planning and demonstrate SOTA zero-shot performance of our method to Robosuite benchmark tasks in simulation (97%). We transfer our local policies from simulation to reality and observe they can solve unseen long-horizon manipulation tasks with up to 8 stages with significant pose, object and scene configuration variation. ManipGen outperforms SOTA approaches such as SayCan, OpenVLA, LLMTrajGen and VoxPoser across 50 real-world manipulation tasks by 36%, 76%, 62% and 60% respectively. Video results at https://mihdalal.github.io/manipgen/
ComaDICE: Offline Cooperative Multi-Agent Reinforcement Learning with Stationary Distribution Shift Regularization
Bui, The Viet, Nguyen, Thanh Hong, Mai, Tien
Offline reinforcement learning (RL) has garnered significant attention for its ability to learn effective policies from pre-collected datasets without the need for further environmental interactions. While promising results have been demonstrated in single-agent settings, offline multi-agent reinforcement learning (MARL) presents additional challenges due to the large joint state-action space and the complexity of multi-agent behaviors. A key issue in offline RL is the distributional shift, which arises when the target policy being optimized deviates from the behavior policy that generated the data. This problem is exacerbated in MARL due to the interdependence between agents' local policies and the expansive joint state-action space. Prior approaches have primarily addressed this challenge by incorporating regularization in the space of either Q-functions or policies. In this work, we introduce a regularizer in the space of stationary distributions to better handle distributional shift. Our algorithm, ComaDICE, offers a principled framework for offline cooperative MARL by incorporating stationary distribution regularization for the global learning policy, complemented by a carefully structured multi-agent value decomposition strategy to facilitate multi-agent training. Through extensive experiments on the multi-agent MuJoCo and StarCraft II benchmarks, we demonstrate that ComaDICE achieves superior performance compared to state-of-the-art offline MARL methods across nearly all tasks. Over the years, deep RL has achieved remarkable success in various decision-making tasks (Levine et al., 2016; Silver et al., 2017; Kalashnikov et al., 2018; Haydari & Yılmaz, 2020). However, a significant limitation of deep RL is its need for millions of interactions with the environment to gather experiences for policy improvement.
- North America > United States > Oregon > Lane County > Eugene (0.14)
- Asia > Singapore (0.04)
- North America > United States > Ohio > Lucas County > Oregon (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.82)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.45)
Soft-QMIX: Integrating Maximum Entropy For Monotonic Value Function Factorization
Chen, Wentse, Huang, Shiyu, Schneider, Jeff
Multi-agent reinforcement learning (MARL) tasks often utilize a centralized training with decentralized execution (CTDE) framework. QMIX is a successful CTDE method that learns a credit assignment function to derive local value functions from a global value function, defining a deterministic local policy. However, QMIX is hindered by its poor exploration strategy. While maximum entropy reinforcement learning (RL) promotes better exploration through stochastic policies, QMIX's process of credit assignment conflicts with the maximum entropy objective and the decentralized execution requirement, making it unsuitable for maximum entropy RL. In this paper, we propose an enhancement to QMIX by incorporating an additional local Q-value learning method within the maximum entropy RL framework. Our approach constrains the local Q-value estimates to maintain the correct ordering of all actions. Due to the monotonicity of the QMIX value function, these updates ensure that locally optimal actions align with globally optimal actions. We theoretically prove the monotonic improvement and convergence of our method to an optimal solution. Experimentally, we validate our algorithm in matrix games, Multi-Agent Particle Environment and demonstrate state-of-the-art performance in SMAC-v2.
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Maximum Entropy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
Algorithms for learning value-aligned policies considering admissibility relaxation
Holgado-Sánchez, Andrés, Arias, Joaquín, Billhardt, Holger, Ossowski, Sascha
The emerging field of \emph{value awareness engineering} claims that software agents and systems should be value-aware, i.e. they must make decisions in accordance with human values. In this context, such agents must be capable of explicitly reasoning as to how far different courses of action are aligned with these values. For this purpose, values are often modelled as preferences over states or actions, which are then aggregated to determine the sequences of actions that are maximally aligned with a certain value. Recently, additional value admissibility constraints at this level have been considered as well. However, often relaxed versions of these constraints are needed, and this increases considerably the complexity of computing value-aligned policies. To obtain efficient algorithms that make value-aligned decisions considering admissibility relaxation, we propose the use of learning techniques, in particular, we have used constrained reinforcement learning algorithms. In this paper, we present two algorithms, $\epsilon\text{-}ADQL$ for strategies based on local alignment and its extension $\epsilon\text{-}CADQL$ for a sequence of decisions. We have validated their efficiency in a water distribution problem in a drought scenario.
- Education (0.46)
- Law (0.46)
- Government (0.46)
Federated Offline Policy Optimization with Dual Regularization
Yue, Sheng, Qin, Zerui, Hua, Xingyuan, Deng, Yongheng, Ren, Ju
Federated Reinforcement Learning (FRL) has been deemed as a promising solution for intelligent decision-making in the era of Artificial Internet of Things. However, existing FRL approaches often entail repeated interactions with the environment during local updating, which can be prohibitively expensive or even infeasible in many real-world domains. To overcome this challenge, this paper proposes a novel offline federated policy optimization algorithm, named $\texttt{DRPO}$, which enables distributed agents to collaboratively learn a decision policy only from private and static data without further environmental interactions. $\texttt{DRPO}$ leverages dual regularization, incorporating both the local behavioral policy and the global aggregated policy, to judiciously cope with the intrinsic two-tier distributional shifts in offline FRL. Theoretical analysis characterizes the impact of the dual regularization on performance, demonstrating that by achieving the right balance thereof, $\texttt{DRPO}$ can effectively counteract distributional shifts and ensure strict policy improvement in each federative learning round. Extensive experiments validate the significant performance gains of $\texttt{DRPO}$ over baseline methods.